<!DOCTYPE html>
<html lang="en" color-mode="light">

  <head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1" />
  <meta name="keywords" content="JavaScript,Hexo" />
  <meta name="author" content="超能虾米" />
  <meta name="description" content="" />
  
  
  <title>
    
      ES中的倒排索引 
      
      
      |
    
     超能虾米的博客
  </title>

  
    <link rel="apple-touch-icon" href="/images/favicon.png">
    <link rel="icon" href="/images/favicon.png">
  

  <!-- Raleway-Font -->
  <link href="https://fonts.googleapis.com/css?family=Raleway&display=swap" rel="stylesheet">

  <!-- hexo site css -->
  <link rel="stylesheet" href="/css/main.css" />
  <link rel="stylesheet" href="//at.alicdn.com/t/font_1886449_67xjft27j1l.css" />
  <!-- 代码块风格 -->
  

  <!-- jquery3.3.1 -->
  
    <script defer type="text/javascript" src="/plugins/jquery.min.js"></script>
  

  <!-- fancybox -->
  
    <link href="/plugins/jquery.fancybox.min.css" rel="stylesheet">
    <script defer type="text/javascript" src="/plugins/jquery.fancybox.min.js"></script>
  
  
<script src="/js/fancybox.js"></script>


  

  

  <script>
    var html = document.documentElement
    const colorMode = localStorage.getItem('color-mode')
    if (colorMode) {
      document.documentElement.setAttribute('color-mode', colorMode)
    }
  </script>
<meta name="generator" content="Hexo 5.4.2"><link rel="alternate" href="/atom.xml" title="超能虾米的博客" type="application/atom+xml">
</head>


  <body>
    <div id="app">
      <div class="header">
  <div class="avatar">
    <a href="/">
      <!-- 头像取消懒加载，添加no-lazy -->
      
        <img src="/images/头像.JPEG" alt="">
      
    </a>
    <div class="nickname"><a href="/">超能虾米</a></div>
  </div>
  <div class="navbar">
    <ul>
      
        <li class="nav-item" data-path="/">
          <a href="/">主页</a>
        </li>
      
        <li class="nav-item" data-path="/archives/">
          <a href="/archives/">归档</a>
        </li>
      
        <li class="nav-item" data-path="/categories/">
          <a href="/categories/">分类</a>
        </li>
      
        <li class="nav-item" data-path="/tags/">
          <a href="/tags/">标签</a>
        </li>
      
        <li class="nav-item" data-path="/friends/">
          <a href="/friends/">朋友</a>
        </li>
      
        <li class="nav-item" data-path="/about/">
          <a href="/about/">关于我</a>
        </li>
      
    </ul>
  </div>
</div>


<script src="/js/activeNav.js"></script>



      <div class="flex-container">
        <!-- 文章详情页，展示文章具体内容，url形式：https://yoursite/文章标题/ -->
<!-- 同时为「标签tag」，「朋友friend」，「分类categories」，「关于about」页面的承载页面，具体展示取决于page.type -->


  <!-- LaTex Display -->

  
    <script async type="text/javascript" src="/plugins/mathjax/tex-chtml.js"></script>
  
  <script>
    MathJax = {
      tex: {
        inlineMath: [['$', '$'], ['\\(', '\\)']]
      }
    }
  </script>





  <!-- clipboard -->

  
    <script async type="text/javascript" src="/plugins/clipboard.min.js"></script>
  
  
<script src="/js/codeCopy.js"></script>







  

  

  

  
  <!-- 文章内容页 url形式：https://yoursite/文章标题/ -->
  <div class="container post-details" id="post-details">
    <div class="post-content">
      <div class="post-title">ES中的倒排索引</div>
      <div class="post-attach">
        <span class="post-pubtime">
          <i class="iconfont icon-updatetime mr-10" title="Update time"></i>
          2023-05-11 20:06:52
        </span>
        
              <span class="post-categories">
                <i class="iconfont icon-bookmark" title="Categories"></i>
                
                <span class="span--category">
                  <a href="/categories/%E5%90%8E%E7%AB%AF/" title="后端">
                    <b>#</b> 后端
                  </a>
                </span>
                
                <span class="span--category">
                  <a href="/categories/%E5%90%8E%E7%AB%AF/%E4%B8%AD%E9%97%B4%E4%BB%B6/" title="中间件">
                    <b>#</b> 中间件
                  </a>
                </span>
                
              </span>
          
              <span class="post-tags">
                <i class="iconfont icon-tags mr-10" title="Tags"></i>
                
                <span class="span--tag mr-8">
                  <a href="/tags/%E7%AE%97%E6%B3%95/" title="算法">
                    #算法
                  </a>
                </span>
                
              </span>
          
      </div>
      <div class="markdown-body">
        <p><img src="http://qiniu.c77544s.top/picgo/202305112007982.png" alt="image.png"><br>Lucene 的倒排索，增加了最左边的一层「字典树」term index，它不存储所有的单词，只存储单词前缀，通过字典树找到单词所在的块，也就是单词的大概位置，再在块里二分查找，找到对应的单词，再找到单词对应的文档列表。</p>
<p>当然，内存寸土寸金，能省则省，所以 Lucene 还用了 FST（Finite State Transducers）对它进一步压缩。</p>
<p>FST 是什么？这里就不展开了，这次重点想聊的，是最右边的 Posting List 的，别看它只是存一个文档 ID 数组，但是它在设计时，遇到的问题可不少。</p>
<p>原生的 Posting List 有两个痛点：</p>
<ul>
<li><strong>如何压缩以节省磁盘空间</strong></li>
<li><strong>如何快速求交并集（intersections and unions）</strong></li>
</ul>
<h2 id="如何压缩节省磁盘空间"><a href="#如何压缩节省磁盘空间" class="headerlink" title="如何压缩节省磁盘空间"></a>如何压缩节省磁盘空间</h2><p><strong>Step 1：Delta-encode —— 增量编码</strong><br>我们只记录元素与元素之间的增量，于是数组变成了：<code>[73, 227, 2, 30, 11, 29]</code></p>
<p><strong>Step 2：Split into blocks —— 分割成块</strong><br>Lucene里每个块是 256 个文档 ID，这样可以保证每个块，增量编码后，每个元素都不会超过 256（1 byte）.<br>为了方便演示，我们假设每个块是 3 个文档 ID：<code>[73, 227, 2], [30, 11, 29]</code></p>
<p><strong>Step 3：Bit packing —— 按需分配空间</strong><br>对于第一个块，<code>[73, 227, 2]</code>，最大元素是227，需要 8 bits，好，那我给你这个块的每个元素，都分配 8 bits的空间。<br>但是对于第二个块，<code>[30, 11, 29]</code>，最大的元素才30，只需要 5 bits，那我就给你每个元素，只分配 5 bits 的空间，足矣。<br> <img src="http://qiniu.c77544s.top/picgo/202305112015943.png" alt="image.png"></p>
<h2 id="如何求交并集"><a href="#如何求交并集" class="headerlink" title="如何求交并集"></a>如何求交并集</h2><p><strong>Option 1: Bitmap</strong><br>假设有这样一个数组：<code>[3,6,7,10]</code><br>那么我们可以这样来表示：<code>[0,0,1,0,0,1,1,0,0,1]</code><br><strong>用 0 表示角标对应的数字不存在，用 1 表示存在。</strong></p>
<p>这样带来了两个好处：</p>
<ul>
<li>节省空间：既然我们只需要0和1，那每个文档 ID 就只需要 1 bit，还是假设有 100M 个文档，那只需要 100M bits &#x3D; 100M * 1&#x2F;8 bytes &#x3D; 12.5 MB，比之前用 Integer 数组 的 200 MB，优秀太多</li>
<li>运算更快：0 和 1，天然就适合进行位运算，求交集，「与」一下，求并集，「或」一下，一切都回归到计算机的起点</li>
</ul>
<p><strong>Option 2: Roaring Bitmaps</strong><br>bitmap 有个硬伤，就是不管有多少个文档，占用的空间都是一样的。<br>举一个极端的例子，有一个数组，里面只有两个文档 ID：<code>[0, 65535]</code><br>用 Bitmap，要怎么表示？<code>[1,0,0,0,…,(超级多个0),…,0,0,1]</code></p>
<p>可见在文档数量不多的时候，使用 Integer 数组更加节省内存。</p>
<p>我们来算一下临界值，很简单，无论文档数量多少，bitmap都需要 8192 bytes，而 Integer 数组则和文档数量成线性相关，每个文档 ID 占 2 bytes，所以：8192 &#x2F; 2 &#x3D; 4096</p>
<p>当文档数量少于 4096 时，用 Integer 数组，否则，用 bitmap.</p>
<blockquote>
<p>这里补充一下 Roaring bitmaps 和之前讲的 Frame Of Reference 的关系。  </p>
<p>Frame Of Reference 是压缩数据，减少磁盘占用空间，所以当我们从磁盘取数据时，也需要一个反向的过程，即解压，解压后才有我们上面看到的这样子的文档 ID 数组：<code>[73, 300, 302, 303, 343, 372]</code> ，接着我们需要对数据进行处理，求交集或者并集，这时候数据是需要放到内存进行处理的，我们有三个这样的数组，这些数组可能很大，而内存空间比磁盘还宝贵，于是需要更强有力的压缩算法，同时还要有利于快速的求交并集，于是有了 Roaring Bitmaps 算法。  </p>
<p>另外，Lucene 还会把从磁盘取出来的数据，通过 Roaring bitmaps 处理后，缓存到内存中，Lucene 称之为 filter cache. 这里补充一下 Roaring bitmaps 和之前讲的 Frame Of Reference 的关系。</p>
</blockquote>

      </div>
      
        <div class="prev-or-next">
          <div class="post-foot-next">
            
              <a href="/2023/05/11/%E5%90%8E%E7%AB%AF/%E6%95%B0%E6%8D%AE%E5%BA%93/SQL%E8%B5%B0%E4%BA%86%E7%B4%A2%E5%BC%95%E8%BF%98%E6%98%AF%E5%BE%88%E6%85%A2%EF%BC%9F/" target="_self">
                <i class="iconfont icon-chevronleft"></i>
                <span>Prev</span>
              </a>
            
          </div>
          <div class="post-attach">
            <span class="post-pubtime">
              <i class="iconfont icon-updatetime mr-10" title="Update time"></i>
              2023-05-11 20:06:52
            </span>
            
                  <span class="post-categories">
                    <i class="iconfont icon-bookmark" title="Categories"></i>
                    
                    <span class="span--category">
                      <a href="/categories/%E5%90%8E%E7%AB%AF/" title="后端">
                        <b>#</b> 后端
                      </a>
                    </span>
                    
                    <span class="span--category">
                      <a href="/categories/%E5%90%8E%E7%AB%AF/%E4%B8%AD%E9%97%B4%E4%BB%B6/" title="中间件">
                        <b>#</b> 中间件
                      </a>
                    </span>
                    
                  </span>
              
                  <span class="post-tags">
                    <i class="iconfont icon-tags mr-10" title="Tags"></i>
                    
                    <span class="span--tag mr-8">
                      <a href="/tags/%E7%AE%97%E6%B3%95/" title="算法">
                        #算法
                      </a>
                    </span>
                    
                  </span>
              
          </div>
          <div class="post-foot-prev">
            
              <a href="/2023/05/12/%E5%90%8E%E7%AB%AF/%E4%B8%AD%E9%97%B4%E4%BB%B6/Kafka%E7%AC%94%E8%AE%B0/" target="_self">
                <span>Next</span>
                <i class="iconfont icon-chevronright"></i>
              </a>
            
          </div>
        </div>
      
    </div>
    
  <div id="btn-catalog" class="btn-catalog">
    <i class="iconfont icon-catalog"></i>
  </div>
  <div class="post-catalog hidden" id="catalog">
    <div class="title">Contents</div>
    <div class="catalog-content">
      
        <ol class="toc"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%A6%82%E4%BD%95%E5%8E%8B%E7%BC%A9%E8%8A%82%E7%9C%81%E7%A3%81%E7%9B%98%E7%A9%BA%E9%97%B4"><span class="toc-text">如何压缩节省磁盘空间</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%A6%82%E4%BD%95%E6%B1%82%E4%BA%A4%E5%B9%B6%E9%9B%86"><span class="toc-text">如何求交并集</span></a></li></ol>
      
    </div>
  </div>

  
<script src="/js/catalog.js"></script>




    
      <div class="comments-container">
        




  
    <script async type="text/javascript" src="/plugins/valine.min.js" onload="loadValineSuc(this)"></script>
  

  <div id="vcomments"></div>

  <script>
    function loadValineSuc() {
      new Valine({
        el: '#vcomments',
        appId: 'H2QqshKF8elT2Si5bfKynT9h-9Nh9j0Va',
        appKey: 'WgtVppOWkTZ8fWNReNI1Ymf7',
        placeholder: '有问题请留言吧!',
        avatar: 'retro',
        lang: 'en'
      })
    }
  </script>

    <style>
      .comments-container .v .vempty {
        display: none!important;
      }
    </style>




      </div>
    
  </div>


        
<div class="footer">
  <div class="social">
    <ul>
      
        <li>
          <a title="github" target="_blank" rel="noopener" href="https://github.com/zchengsite/hexo-theme-oranges">
            <i class="iconfont icon-github"></i>
          </a>
        </li>
      
        <li>
          <a title="email" target="_blank" rel="noopener" href="https://c77544s.github.io/about/">
            <i class="iconfont icon-envelope"></i>
          </a>
        </li>
      
        <li>
          <a title="wechat" target="_blank" rel="noopener" href="https://c77544s.github.io/about/">
            <i class="iconfont icon-wechat"></i>
          </a>
        </li>
      
        <li>
          <a title="rss" href="/atom.xml">
            <i class="iconfont icon-rss"></i>
          </a>
        </li>
      
    </ul>
  </div>
  
    
    <div class="footer-more">
      
        <a target="_blank" rel="noopener" href="https://github.com/zchengsite/hexo-theme-oranges">Copyright © 2023 Oranges</a>
        
    </div>
  
    
    <div class="footer-more">
      
        <a target="_blank" rel="noopener" href="https://github.com/zchengsite/hexo-theme-oranges">Theme by Oranges | Powered by Hexo</a>
        
    </div>
  
  
</div>

      </div>

      <div class="tools-bar">
        <div class="back-to-top tools-bar-item hidden">
  <a href="javascript: void(0)">
    <i class="iconfont icon-chevronup"></i>
  </a>
</div>


<script src="/js/backtotop.js"></script>



        
  <div class="search-icon tools-bar-item" id="search-icon">
    <a href="javascript: void(0)">
      <i class="iconfont icon-search"></i>
    </a>
  </div>

  <div class="search-overlay hidden">
    <div class="search-content" tabindex="0">
      <div class="search-title">
        <span class="search-icon-input">
          <a href="javascript: void(0)">
            <i class="iconfont icon-search"></i>
          </a>
        </span>
        
          <input type="text" class="search-input" id="search-input" placeholder="搜索...">
        
        <span class="search-close-icon" id="search-close-icon">
          <a href="javascript: void(0)">
            <i class="iconfont icon-close"></i>
          </a>
        </span>
      </div>
      <div class="search-result" id="search-result"></div>
    </div>
  </div>

  <script type="text/javascript">
    var inputArea = document.querySelector("#search-input")
    var searchOverlayArea = document.querySelector(".search-overlay")

    inputArea.onclick = function() {
      getSearchFile()
      this.onclick = null
    }

    inputArea.onkeydown = function() {
      if(event.keyCode == 13)
        return false
    }

    function openOrHideSearchContent() {
      let isHidden = searchOverlayArea.classList.contains('hidden')
      if (isHidden) {
        searchOverlayArea.classList.remove('hidden')
        document.body.classList.add('hidden')
        // inputArea.focus()
      } else {
        searchOverlayArea.classList.add('hidden')
        document.body.classList.remove('hidden')
      }
    }

    function blurSearchContent(e) {
      if (e.target === searchOverlayArea) {
        openOrHideSearchContent()
      }
    }

    document.querySelector("#search-icon").addEventListener("click", openOrHideSearchContent, false)
    document.querySelector("#search-close-icon").addEventListener("click", openOrHideSearchContent, false)
    searchOverlayArea.addEventListener("click", blurSearchContent, false)

    var searchFunc = function (path, search_id, content_id) {
      'use strict';
      var $input = document.getElementById(search_id);
      var $resultContent = document.getElementById(content_id);
      $resultContent.innerHTML = "<ul><span class='local-search-empty'>First search, index file loading, please wait...<span></ul>";
      $.ajax({
        // 0x01. load xml file
        url: path,
        dataType: "xml",
        success: function (xmlResponse) {
          // 0x02. parse xml file
          var datas = $("entry", xmlResponse).map(function () {
            return {
              title: $("title", this).text(),
              content: $("content", this).text(),
              url: $("url", this).text()
            };
          }).get();
          $resultContent.innerHTML = "";

          $input.addEventListener('input', function () {
            // 0x03. parse query to keywords list
            var str = '<ul class=\"search-result-list\">';
            var keywords = this.value.trim().toLowerCase().split(/[\s\-]+/);
            $resultContent.innerHTML = "";
            if (this.value.trim().length <= 0) {
              return;
            }
            // 0x04. perform local searching
            datas.forEach(function (data) {
              var isMatch = true;
              var content_index = [];
              if (!data.title || data.title.trim() === '') {
                data.title = "Untitled";
              }
              var orig_data_title = data.title.trim();
              var data_title = orig_data_title.toLowerCase();
              var orig_data_content = data.content.trim().replace(/<[^>]+>/g, "");
              var data_content = orig_data_content.toLowerCase();
              var data_url = data.url;
              var index_title = -1;
              var index_content = -1;
              var first_occur = -1;
              // only match artiles with not empty contents
              if (data_content !== '') {
                keywords.forEach(function (keyword, i) {
                  index_title = data_title.indexOf(keyword);
                  index_content = data_content.indexOf(keyword);

                  if (index_title < 0 && index_content < 0) {
                    isMatch = false;
                  } else {
                    if (index_content < 0) {
                      index_content = 0;
                    }
                    if (i == 0) {
                      first_occur = index_content;
                    }
                    // content_index.push({index_content:index_content, keyword_len:keyword_len});
                  }
                });
              } else {
                isMatch = false;
              }
              // 0x05. show search results
              if (isMatch) {
                str += "<li><a href='" + data_url + "' class='search-result-title'>" + orig_data_title + "</a>";
                var content = orig_data_content;
                if (first_occur >= 0) {
                  // cut out 100 characters
                  var start = first_occur - 20;
                  var end = first_occur + 80;

                  if (start < 0) {
                    start = 0;
                  }

                  if (start == 0) {
                    end = 100;
                  }

                  if (end > content.length) {
                    end = content.length;
                  }

                  var match_content = content.substr(start, end);

                  // highlight all keywords
                  keywords.forEach(function (keyword) {
                    var regS = new RegExp(keyword, "gi");
                    match_content = match_content.replace(regS, "<span class=\"search-keyword\">" + keyword + "</span>");
                  });

                  str += "<p class=\"search-result-abstract\">" + match_content + "...</p>"
                }
                str += "</li>";
              }
            });
            str += "</ul>";
            if (str.indexOf('<li>') === -1) {
              return $resultContent.innerHTML = "<ul><span class='local-search-empty'>No result<span></ul>";
            }
            $resultContent.innerHTML = str;
          });
        },
        error: function(xhr, status, error) {
          $resultContent.innerHTML = ""
          if (xhr.status === 404) {
            $resultContent.innerHTML = "<ul><span class='local-search-empty'>The search.xml file was not found, please refer to：<a href='https://github.com/zchengsite/hexo-theme-oranges#configuration' target='_black'>configuration</a><span></ul>";
          } else {
            $resultContent.innerHTML = "<ul><span class='local-search-empty'>The request failed, Try to refresh the page or try again later.<span></ul>";
          }
        }
      });
      $(document).on('click', '#search-close-icon', function() {
        $('#search-input').val('');
        $('#search-result').html('');
      });
    }

    var getSearchFile = function() {
        var path = "/search.xml";
        searchFunc(path, 'search-input', 'search-result');
    }
  </script>




        
  <div class="tools-bar-item theme-icon" id="switch-color-scheme">
    <a href="javascript: void(0)">
      <i id="theme-icon" class="iconfont icon-moon"></i>
    </a>
  </div>

  
<script src="/js/colorscheme.js"></script>





        
  
    <div class="share-icon tools-bar-item">
      <a href="javascript: void(0)" id="share-icon">
        <i class="iconfont iconshare"></i>
      </a>
      <div class="share-content hidden">
        
          <a class="share-item" href="https://twitter.com/intent/tweet?text=' + ES%E4%B8%AD%E7%9A%84%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95 + '&url=' + http%3A%2F%2Fc77544s.gitee.io%2F2023%2F05%2F11%2F%25E5%2590%258E%25E7%25AB%25AF%2F%25E4%25B8%25AD%25E9%2597%25B4%25E4%25BB%25B6%2FES%25E4%25B8%25AD%25E7%259A%2584%25E5%2580%2592%25E6%258E%2592%25E7%25B4%25A2%25E5%25BC%2595%2F + '" target="_blank" title="Twitter">
            <i class="iconfont icon-twitter"></i>
          </a>
        
        
          <a class="share-item" href="https://www.facebook.com/sharer.php?u=http://c77544s.gitee.io/2023/05/11/%E5%90%8E%E7%AB%AF/%E4%B8%AD%E9%97%B4%E4%BB%B6/ES%E4%B8%AD%E7%9A%84%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95/" target="_blank" title="Facebook">
            <i class="iconfont icon-facebooksquare"></i>
          </a>
        
      </div>
    </div>
  
  
<script src="/js/shares.js"></script>



      </div>
    </div>
  </body>
</html>
